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Abstract 

^ In this work we consider the task of relaxing the i.i.d assump- 
tion in pattern recognition (or classification), aiming to make exist- 
ing learning algorithms applicable to a wider range of tasks. Pattern 
recognition is guessing a discrete label of some object based on a set 
of given examples (pairs of objects and labels). We consider the case 
of deterministically defined labels. Traditionally, this task is studied 
under the assumption that examples are independent and identically 
distributed. However, it turns out that many results of pattern recog- 
nition theory carry over a weaker assumption. Namely, under the 
assumption of conditional independence and identical distribution of 
objects, while the only assumption on the distribution of labels is that 
the rate of occurrence of each label should be above some positive 
threshold. 

We find a broad class of learning algorithms for which estimations 
of the probability of a classification error achieved under the classical 
i.i.d. assumption can be generalised to the similar estimates for the 
case of conditionally i.i.d. examples. 



^ Parts of the results were reported on International Conference on Machine Learning, 
2004,123, and on 15th International Conference on Algorithmic Learning Theory, 2004, 
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1 Introduction 



Pattern recognition (or classification) is, informally, the following task. There 
is a finite number of classes of some complex objects. A predictor is learn- 
ing to label objects according to the class they belong to (i.e. to classify), 
based only on some examples (labelled objects). One of the typical practical 
examples is recognition of a hand-written text. In this case, an object is 
a hand-written letter and a label is the letter of an alphabet it represents. 
Other examples include DNA sequence identification, recognition of an illness 
based on a set of symptoms, speech recognition, and many others. 

The formal model of the task used most widely is described, for example, 
in |2Hl! and can be briefly introduced as follows (we will later refer to it as 
"the i.i.d. model"). The objects x G X are drawn independently and identi- 
cally distributed (i.i.d.) according to some unknown (but fixed) probability 
distribution P{x). The labels ?/ G Y are given for each object according to 
some (also unknown but fixed) function^ ri{x). The space Y of labels is as- 
sumed to be finite (often binary). The task is to construct the best predictor 
for the labels, based on the data observed, i.e. actually to "learn" ri{x). 

This task is usually considered in either of the following two settings. In 
off-line setting a (finite) set of examples is divided into two finite subsets, the 
training set and the testing set. A predictor is constructed based on the first 
set and then is used to classify the objects from the second. In online setting 
a predictor starts by classifying the first object with zero knowledge; then it 
is given the correct label and (having "learned" this information) proceeds 
with classifying the second object, the correct second label is given, and so 
on. 

There is a plenty of algorithms developed for solving pattern recognition 
tasks (see [TUII2H1IIS] for the most widely used methods). However, the i.i.d 
assumption, which is central in the model, is too tight for many applications. 
It turns out that it is also too tight for a wide range of methods developed 
under the assumptions of the model: they work nearly as well under weaker 
conditions. 

First consider the following example. Suppose we are trying to recognise a 
hand- written text. Obviously, letters in the text are dependent (for example, 
we strongly expect to meet "u" after "q"). This seemingly implies that 

^Often (e.g. in 28 ) a more general situation is considered, the labels are drawn ac- 
cording to some probability distribution P(y\x), i.e. each object can have more than one 
possible label. 
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pattern recognition can not be applied to this task, which is, however, one 
of their classical applications. 

We show that the following two assumptions on the distribution of exam- 
ples are sufficient for pattern recognition. First, that the dependence between 
objects is only that between their labels and the type of object-label depen- 
dence does not change in time. 

These intuitive ideas lead us to the following model (to which wc refer 
as "the conditional model"). The labels y G Y are drawn according to 
some unknown (but fixed) distribution over the set of all infinite sequences 
of labels. There can be any type of dependence between labels; moreover, we 
can assume that we are dealing with any (fixed) combinatorial sequence of 
labels. However, in this sequence the rate of occurrence of each label should 
keep above some positive threshold. For each label y the corresponding 
object a; e X is generated according to some (unknown but fixed) probability 
distribution P{x\y). All the rest is as in the i.i.d. model. 

The main difference from the i.i.d. model is in that in the conditional 
model we made the distribution of labels primal; having done that we can 
relax the requirement of independence of objects to the conditional indepen- 
dence. 

In this work we provide a tool for obtaining estimations of probability 
of error of a predictor in the conditional model from an estimation of the 
probability of error in the i.i.d. model. The general theorems about extending 
results concerning performance of a predictor to the conditional model are 
illustrated on two classes of predictors. 

First, we extend weak consistency results concerning partitioning and 
nearest neighbour estimates from the i.i.d. model to the conditional model. 

Second, we use some results of Vapnik-Chervonenkis theory to estimate 
performance in the conditional model (on finite amount of data) of predictors 
minimising empirical risk, and also obtain some strong consistency results. 

These results are obtained as applications of the following rule. The only 
assumption on a predictor under which a predictor works in the new model 
as well as in the i.i.d. model is what we call tolerance to data: in any large 
datasct there is no small subset which strongly changes the probability of 
error. This property should also hold with respect to permutations. This 
assumption on a predictor should be valid in the i.i.d. model. Thus, the 
results achieved in the i.i.d. model can be extended to the conditional model; 
this concerns distribution-free results as well as distribution-specific, results 
on the performance on finite samples as well as asymptotic results. 
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Various approaches to relaxing the i.i.d. assumption in learning tasks 
have been proposed in the literature. Thus, in [T71 Hn] the authors study the 
nearest neighbour and kernel estimators for the task of regression estimation 
with continuous regression function, under the assumption that labels are 
conditionally independent given their objects, while objects form any indi- 
vidual sequence. Another approach is considered in [20], where a regression 
estimation scheme is proposed which is consistent for any individual stable 
sequence of object-label pairs (no probabilistic assumptions), assuming that 
there is a known upper bound on the variation of regression function. 

There are also several approaches in which different types of assumptions 
on the joint distribution of objects and labels are made; then the authors 
construct a predictor or a class of predictors, to work well under the as- 
sumptions made. Thus, in and a generalisation of PAC approach 
to Markov chains with finite or countable state space is presented. The 
estimates of probability of error are constructed for this cases, under the as- 
sumption that the optimal rule generating examples belongs to a pre-specified 
class of decision rules. There is also a track of research on prediction under 
the assumption that the distribution generating examples is stationary and 
ergodic. The basic difference from our learning task, apart from different 
probabilistic assumption, is in that we are only concerned with object-label 
dependence, while in predicting ergodic sequences it is label-label (time se- 
ries) dependence that is of primary interest. On this task see |21 ISl 122] 
and references therein. Another approach is taken in |3] where the PAC 
model is generalised to allow concepts changing over time. Here the method- 
ology is proposed to track time series dependences, that is the authors find 
some classes of dependences which can be exploited for learning. Again the 
difference with our approach is that we try to find a (broad) class of problems 
where the time series dependence can be ignored by any reasonable pattern 
recognition method rather than constructing methods to use some specific 
dependences of this kind. 

2 Definitions and General Results 

Consider a sequence of examples (xi, ?/i), (x2, ?/2), • • • ; each example Zi := 
{xi,yi) consists of an object G X and a label yi := ri{xi) G Y, where X 
is a measurable space called an object space, Y := {0, 1} is called a label 
space and : X ^ Y is some deterministic function. For simplicity we 
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made the assumption that the space Y is binary, but all results easily extend 
to the case of any finite space Y. The notation Z := X x Y is used for 
the measurable space of examples. Objects are drawn according to some 
probability distribution P on X°° (and labels are defined by rj). Thus we 
consider only the case of deterministically defined labels (that is, the noise- 
free model); in section El we discuss possible generalisations. 

The notation P is used for distributions on X°° while the symbol P is 
reserved for distributions on X. In the latter case P°° denotes the i.i.d. 
distribution on X°° generated by P. Correspondingly we will use symbols 
E, E and E°° for expectations over spaces X°° and X. Letters x, y, z (with 
indices) will be used for elements of spaces X, Y, Z correspondingly, while 
letters X, Y, Z are reserved for random variables on these spaces. 

The traditional assumption about the distribution P generating objects is 
that examples are independently and identically distributed (i.i.d.) according 
to some distribution P on X (i.e. P = P°°). 

Here we replace this assumption with the following two conditions. 

First, for any G N and for any measurable set A C X 

p{Xn eA\Yn, Xi, Fi, . . . , r„_i) = P(x„ g A | r„) (i) 

(i.e. some versions of conditional probabilities coincide). This condition looks 
very much like Markov condition which requires that each object depends on 
the past only through its immediate predecessor. The condition says that 
each object depends on the past only through its label. 

Second, for any ?/ G Y, for any ni,n2 G N and for any measurable set 
A C X 

P(X„, eA\Y^,=y) = P(X„, eA\Y^, = y) (2) 

(i.e. the process is uniform in time; ([Q) allows dependence in n). 

Note that the first condition means that objects are conditionally in- 
dependent given labels (on conditional independence see [7]). Under the 
conditions (0) and Q we say that objects are conditionally independent and 
identically distributed (conditionally i.i.d). 

For each y E Y denote the distribution P(X„ \ Yn = y) by Py (it does 
not depend on n by Q ). Clearly, the distributions Pq and Pi define some 
distributions P on X up to a parameter p G [0, 1]. That is, Pp{A) = pPi{A) + 
(1 —p)Pq{A) for any measurable set A C X and for each p G [0, 1]. Thus with 
each distribution P satisfying the assumptions (0) and Q we will associate 
a family of distributions Pp, p G [0, 1]. 
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The assumptions of the conditional model can be also interpreted as fol- 
lows. Assume that we have some individual sequence (?/n)neN of labels and 
two probability distributions Pq and Pi on X, such that there exists sets Xq 
and Xi in X such that Pi(Xi) = Po(^o) = 1 and Po(^i) = Pi(^o) = 
(i.e. Xq and Xi define some function rj). Each example G X is drawn 
according to the distribution Py^ ; examples are drawn independently of each 
other. 

A predictor is a measurable function r„ := T{xi,yi, . . . , Xn, Vn, Xn+i) tak- 
ing values in Y (more formally, a family of functions indexed by n). 
The probability of error of a predictor F on each step n is defined as 

err„(r,P,2;i, . . . := P{{x,y) G Z : ?/ 7^ r„(zi, . . .,Zn,x)] 

{zi, 1 < i < n are fixed and the probability is taken over Zn+i)- We will 
sometimes omit some of the arguments of err„ when it can cause no confusion; 
in particular, we will often use a short notation P(err„(r, Zi, . . . , Zn) > e) 
and an even shorter one P(err„(r) > e) in place of 

P{zi, . ..,Zn: eYYn(T, P,Zi,..., Zn) > s] . 

For a pair of distributions Pq and Pi and any 5 G (0, 1/2) define 

S75iPo,Pi,n,E):= sup P;°(err„(r) > e) (3) 

pe[s,i-S] 

For a predictor F and a distribution P on X define 
A{P,n,Zi,...,Zn) := max | err„(F, P°°, Zi, . . . , 

j<Xn; 7r:{l,...,n}-*{l,...,ra} 
Grrfj_j(F, P , 2^7r(x), . . . , Z-j^(^n—j))\- 

Define the tolerance to data of F as 

A{P,n,e) ■.= P"{A{P,n,Zi,...,Zn)>e) (4) 

for any n G N, any e > and x„ := a/ n log n (see the end of Section |31 for 
the discussion of the choice of the constants Furthermore, for a pair of 
distributions Pq and Pi and any 6 G (0, 1/2) define 

As{Po,Pun,e) := sup A(Pp,n,e). 

p6[<5,l-<5] 
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Tolerance to data means, in effect, that in any typical large portion of 
data there is no small portion that changes strongly the probability of error. 
This property should also hold with respect to permutations. 

We will also use another version of tolerance to data, in which instead of 
removing some examples we replace them with an arbitrary sample 2;^ , . . . , 2;^ 
consistent with rj: 

A{P,Zi, ...,Zn) := sup 

j<}<„;TT:{l,...,n}^{l,...,n};z'^^^,...,z'„ 

I err„(r, P°°, Zi, . . . , z^) - err„(r, P°°, Ci, ■ • ■ , Cn) |, 

where '■= z-^^i) if i < n — j and (^{i) '■= z'^ otherwise; the maximum is 
taken over all z[, n — j < i < n consistent with rj. Define 

A(P, n, e) := P"(A(P, n, Zi, . . . , Z„) > e) 

and 

As{PQ,Pi,n,e) := sup A(Pp,n,e). 

pe[<5,i-5] 

The same notational convention will be applied to A and A as to err„. 

Various notions similar to tolerance to data have been studied in litera- 
ture. Perhaps first they appeared in connection with deleted or condensed 
estimates (see e.g. j2Sl), and were later called stability (see jHllIlI for present 
studies of different kinds of stability, and for extensive overviews). Naturally, 
such notions arise when there is a need to study the behaviour of a predictor 
when some of the training examples are removed. These notions are much 
similar to what we call tolerance to data, only we are interested in the maxi- 
mal deviation of probability of error while usually it is the average or minimal 
deviations that are estimated. 

A predictor developed to work in the off-line setting should be, loosely 
speaking, tolerant to small changes in a training sample. The next theorem 
shows under which conditions this property of a predictor can be utilised. 

Theorem 1. Suppose that a distribution P generating examples is such that 
the objects are conditionally i.i.d, i.e. P satisfies (OJj and Fix some 
S e (0, 1/2], let p{n) := ^#{2 < n : = 1} and C„ := P(5 < p{n) <l-5) 
for each n G N. Let also an '■= jzyj^- -^'^^ ^''^V pi^^dictor T and any e > 
we have 

P(err„(r) > e) < C-'a^iVsiPo, Pi,n + >in,Se/2) 

+ AsiPo, Pi, n + Xn, fe/2)) + (1 - Cn), (5) 
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and 



P(err„(r) > e) < C-^«„(v5(^o, ^i, fe/2) 

+ As{Po,Pi,n,5e/2)) + il-Cn). (6) 

The proofs for this section can be found in Appendix A. 

The theorem says that if we know with some confidence Cn that the rate 
of occurrence of each label is not less than some (small) S, and have some 
bounds on the error rate and tolerance to data of a predictor in the i.i.d. 
model, then we can obtain bounds on its error rate in the conditional model. 

Thus we have a tool for estimating the performance of a predictor on each 
finite step n. In Section |3] we will show how this result can be applied to 
predictors minimising empirical risk. However, if we are only interested in 
asymptotic results the formulations can be somewhat simplified. 

Consider the following asymptotic condition on the frequencies of labels. 
Define p{n) := ^#{^ < n : Yi = 1}. We say that the rates of occurrence of 
labels are bounded from below if there exist such 6, < 6 < 1/2 that 

hm P{p{n) G [5, 1 - 6]) = 1. (7) 

n^oo 

As the condition (|7j) means C„ — 1 we can derive from Theorem the 
following corollary. 

Corollary 1. Suppose that a distribution P satisfies (Op, and ^ for 
some 6 G (0, 1/2]. Let T be such a predictor that 

lim X7siPo,Pi,n,E)=0 (8) 

n—>oo 

and either 

lim AsiPo,Pun,e) = (9) 

n— ►oo 

or 

lim As{Po, Pun, e) = (10) 

n— >oo 

for any e > 0. Then 

E(err„(r,P,Zi,...,Z„)) ^0. 

In SectionElwe show how this statement can be applied to prove weak con- 
sistence of some classical nonparametric predictors in the conditional model. 
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3 Application to classical nonparametric pre- 
dictors 

In this section we will consider two types of classical nonparametric predic- 
tors: partitioning and nearest neighbour classifiers. 

The nearest neighbour predictor assigns to a new object Xn+i the label 
of its nearest neighbour among xi, . . . ,Xn- 

^n{Xi, t/i, . . . , Xji, Un^ Xn+l) ■ Uj^ 

where j := argmin^^^^^ „ ||x — Xi\\. 

For i.i.d. distributions this predictor is also consistent, i.e. 

E°°(err„(r,P°^)) ^0, 

for any distribution P on X (see jH]). 
We generalise this result as follows. 

Theorem 2. Let T be the nearest neighbour classifier. Let P be some distri- 
bution on X°° satisfying (OJj, (0j and Then 

E(err„(r,P))^0. 

The proofs for this section can be found in Appendix B. 

A partitioning predictor on each step n partitions the object space X = 
W^, c? G N into disjoint cells A", , . . . and classifies in each cell according 
to the majority vote: 

r(zi Zn X) •= < TTi=\ ^y^=\^x,(^A{x) < I]"=i 4,=o4,eA(x) 

(^1 otherwise, 
where A{x) denotes the cell containing x. Define 

diam(74) := sup ||x — y\\ 

x,y^A 

and 

n 
i=l 

It is a well known result (see, e.g. [TU]) that a partitioning predictor is 
weakly consistent, provided certain regulatory conditions on the size of cells. 
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More precisely, let F be a partitioning predictor such that diam(A(X)) 
in probability and N{X) ^ oo in probability. Then for any distribution P 
on X 

E°°(err„(r,P°°)) ^0. 

We generalise this result to the case of conditionally i.i.d. examples as 
follows. 

Theorem 3. Let T be a partitioning predictor such that diam(A(X)) — > 
in probability and N{X) ^ oo in probability, for any distribution generating 
i.i.d. examples. Then 

E(err„(r,P))->0 
for any distribution P on X°° satisfying (OJj, (0) and 

Observe that we only generalise results concerning weak consistency of 
(one) nearest neighbour and non-data-dependent partitioning rules. More 
general results exist (see e.g. [S1,|IH]), in particular for data-dependent rules. 
However, we do not aim to generalise state-of-the-art results in nonparametric 
classification, but rather to illustrate that weak consistency results can be 
extended to the conditional model. 

4 Application to Empirical Risk Minimisa- 
tion. 

In this section we show how to estimate the performance of a predictor min- 
imising empirical risk (over certain class of functions) using Theorem^ To do 
this we estimate the tolerance to data of such predictors, using some results 
from Vapnik-Chervonenkis theory. For the overviews of Vapnik-Chervonenkis 
theory see [23 12HI HDl • 

Let X = M*^ for some d & N and let C be a class of measurable functions 
of the form : X — > Y = {0, 1}, called decision functions. For a probability 
distribution P on X define err(y9, P) := P{ip{Xi) ^ Yi). If the examples 
are generated i.i.d. according to some distribution P, the aim is to find a 
function from C for which err((y9, P) is minimal: 

Lpp = argmin^gc err(v, 
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In the theory of empirical risk minimisation this function is approximated 
by the function 

99* := argmineff 

where efr„((y9) := Yll=i ^'fiix^)¥=y^ empirical error functional, based on 

a sample (X^, Fj), i = l,...,n. Thus, Tn{zi, ...,Zn, Xn+i) := ipl{xn+i) is a 
predictor minimising empirical risk over the class of functions C. 

One of the basic results of Vapnik-Chervonenkis theory is the estimation 
of the difference of probabilities of error between the best possible function 
in the class {<fp) and the function which minimises empirical error: 



P( err„(r, P°°) - err(¥?p, P) > e) < 8S{C, n)e 



-ne2/i28 



where the symbol 5(C,n) is used for the n-th shatter coefficient of the class 
C: 

S{C,n):= max #{CnA:CGC}. 

A: = {xi,...,x„}cX 

Thus, 

P(err„(r) > e) < /err{^p,p)>./2 + 85(C, n)e-'^'""'^\ 

A particularly interesting case is when the optimal rule belongs to C, i.e. 
when rj E C. This situation was investigated in e.g. |7fl 13]. Obviously, in 
this case ipp G C and err((y9p,P) = for any P. Moreover, a better bound 
exists (see [211131101) 



P(err„(r,P) > e) < 2S{C,n)e' 



-ne/2 



Theorem 4. Let C be a class of decision functions and let T be a predic- 
tor which for each n G N minimises err„ over C on the observed examples 
(zi,...,z„). Fix some 6 G (0,1/2], let p{n) := -^i^ii < n : Yi = 0} and 
Cn '■= P(5 < p{n) < 1 — 5) for each G N. Assume n > A/e^ and let 
Oin '■= -r-rn=- We have 

A(Po, Pi, n, e) < 165(C, n)e~""'/^^2_ 

(which does not depend on the distributions Pq and Pi) and 



P(err„(r, P) > e) < l2err{^p^^^,P,/,)>e/2 (12) 

+16anC-'S{C,n)e--'''"/'''' + (1 - C„). 
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// in addition rj G C then 



A{n,e) < 4:S{C,2n) 



(13) 



and 



P(err„(r,P) > e) < AanC-'S{C,n)e-''''/'' + (1 - Cn). 



(14) 



Thus, if we have bounds on the VC dimension of some class of classifiers, 
we can obtain bounds on the performance of predictors minimising empirical 
error for the conditional model. 

Next we show how strong consistency results can be achieved in the con- 
ditional model. For general strong universal consistency results (with exam- 
ples) see [inii2Hii2ni- 

Denote the VC dimension of C by V{C): 



Using Theorem|3]and Borel-Cantelli lemma, we obtain the following corollary. 
Corollary 2. Let , k ^ N be a sequence of classes of decision functions 



where T is a predictor which in each trial n minimises empirical risk over 
C'^" and P is any distribution satisfying ([TJ), (0) and ~ C*„) < oo. 

In particular, if we use bound on the VC dimension on classes of neural 
networks provided in then we obtain the following corollary. 

Corollary 3. Let T be a classifier that minimises the empirical error over 
the class C'^^\ where C^'^^ is the class of neural net classifiers with k nodes in 
the hidden layer and the threshold sigmoid, and k ^ oo so that k logn/n 
as n —>■ oo. Let P be any distribution on satisfying ^}) and (0) such that 



V{C) := max{r2 G N : S{C,n) = 2"}. 




err(r,P) P-a.s. 



Yln=ii^ -Cn) <oo. Then 



lim errji(r) = P-a.s. 



oo 
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5 Discussion 



We have introduced "conditionally i.i.d." model for pattern recognition 
which generalises the commonly used i.i.d. model. Naturally, a question 
arises whether our conditions on the distributions and on predictors are nec- 
essary, or they can be yet more generalised in the same direction. In this 
section we discuss the conditions of the new model from this point of view. 

The first question is, can the same results be obtained without assump- 
tions on tolerance to data? The following negative example shows that some 
bounds on tolerance to data are necessary. 

Remark 1. There exists a distribution P on X°° satisfying (QJj and ^ such 
that P{\pn - 1/2| > = for any n (i.e. C„ = 1 for any 6 G (0, 1/2) 
and n > jjj^zrg^) and a predictor T such that Pj^(err„ > 0) < 2-*^^" for any 
p G [5, 1 — 5] and P(err„ = 1) = 1 for n > 1. 

Proof. Let X = Y = {0, 1}. We define the distributions Py as Py{X = y) = 
1, for each y G Y (i.e. r]{x) = x for each x). The distribution P|yoo is defined 

as a Markov distribution with transition probability matrix ^ ^ ) ' 

always generates sequences of labels ...01010101.... 
We define the predictor T as follows 

P _^(l-Xn if < n : ?/i = 0} -n/2| < 1, 
" ' \ Xn otherwise. 

So, in the case when the distribution P is used to generate the examples, F 
is always seeing either n — 1 zeros and n ones, or n zeros and n ones which, 
consequently, will lead it to always predict the wrong label. It remains to 
note that this is almost improbable in the case of an i.i.d. distribution. □ 

Another point is the requirement on the frequencies of labels. In particu- 
lar, the assumption (jZj) might appear redundant: if the rate of occurrence of 
some label tends to zero, can we just ignore this label without affecting the 
asymptotic? It appears that this is not the the following example 

illustrates. 

Remark 2. There exist a distribution P on which satisfies (OP and 
but for which the nearest neighbour predictor is not consistent, i.e. the 
probability of error does not tend to zero. 
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Proof. Let X = [0, 1], let rj{x) = if a; is rational and ri{x) = 1 otherwise. 
The distribution Pi is uniform on the set of irrational numbers, while Pq is 
any distribution such that P{x) ^ for any rational x. (This construction 
is due to T. Cover.) The nearest neighbour predictor is consistent for any 
i.i.d. distribution which agrees with the definition, i.e. for any p — P{Y — 

l)e[0,l]. 

Next we construct the distribution P|yo°- Fix some e, < e < 1. Assume 
that according to P the first label is always 1, (i.e. P{yi = 1) = 1; the object 
is an irrational number) . Next ki labels are always (rationals) , then follows 
1, then k2 zeros, and so on. It is easy to check that there exists such sequence 
ki,k2, ■ ■ ■ that with probability at least e we have 

, 1 — £ 

max -Pi {a; : Xi is the nearest neighbour of x} < — 

i<n: Xi is irrational ?71(TI) 

where m(n) is the total number of irrational objects up to the trial n. On 
each step n such that n — t + Yl]=i h for some t e N (i.e. on each irrational 
object) we have 

E(err„(r,P)) 

> £ I 1 — y ^ is the nearest neighbour of X) 1 > s'^ 

\ j<n: Xj is irrational / 

As irrational objects are generated infinitely often (that is, with intervals ki), 
the probability of error does not tend to zero. □ 

Another question is whether the results can be generalised to the case of 
non-deterministically defined labels, which is often considered in literature. 
It should be noted that we consider the task of learning object-label depen- 
dence, ignoring the label-label dependence (and prohibiting any dependence 
apart from these). On one hand, it allows us to consider any sort of label- 
label dependence. On the other hand, the best bound on the probability 
of error we can obtain is the maximum of the class-conditional probabilities 
of error (as nothing is known about the probability of the next label), and 
not the so-called Bayes error, which is the best achievable bound in the i.i.d. 
case. 

Thus, if wc want to consider stochastically defined labels, we should re- 
strict our attention to class-conditional probabilities of error. On this way 
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also some obstacles can be met. In particular, the function rj, which in this 
case is defined as ri{x) := P(F„ = 1|X„ = x) should not depend on n, which 
will require more restrictive definition of constants C„ and the condition ((7j). 
We leave this question for further investigation. 

One more point which needs clarification is the choice of the constants Xn- 
We have fixed these constants for the sake of simplicity of notations, however, 
they can be made variable, as long obeys the following condition. 



almost surely for any p G (0, 1) and any probability distribution P on X such 
that P{y = 1) = p, where Pn := ^#{^ ^ n : Yi = 0}. 

Appendix A: proofs for Section |21 

Before proceeding with the proof of Theorem Q we give some definitions and 
supplementary facts. 

Define the conditional probabilities of error of F as follows 

err°(r,P,2;o, ...,Zn) := PC^n+i 7^ ^{zi, . . . , 2;„, X„+i)|F„+i = 0), 

err„(r,P, 

(with the same notational convention as used with the definition of err„(r)). 
In words, for each y & Y = {0, 1} we define err^ as the probability of all 
a; G X, such that F makes an error on n'th trial, given that Y^+i = y and 
fixed zi, . . . , Zn- 

For any y := ?/2, • • • ) e Y°°, define y„ := (yi, ...,?/„) and p„(y) : = 
<n:yi = 0}, for n > 1. 

Clearly (from the assumption (P) ) the random variables Xi, . . . , X„ are 
mutually conditionally independent given Yi, . . . ,Yn, and by ^ they are 
distributed according to Py^, I < i < n. Hence, the following statement is 



Lemma 1. Fix some n > 1 and some y G Y°° such that P{{Yi, . . . .l^+i) = 



lim {n\pn -p\ < x„} = 



oo 



valid. 



Yn+i) 7^ 0. Then 



P{eTTl"+^{T)>e\{Y^,...,Y^) 




err: 



(F)>£ (Fi,...,F„) 



Yn) 



for any p G (0, 1). 
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Proof of Theorem^ Fix some n > 1, some y G Y and such G Y°° that 



n8 < pn{y') <n{l-6) and P((yi, . . . , F„) = yi) ^ 0. Let p := Pn{y')/n. 



for any y such that P(Yi = j/^, . . . , y„ = y.;^, F„+i = 7^ (recall that we use 
upper-case letters for random variables and lower-case for fixed variables, so 
that the probabilities in the above formula are labels-conditional). 

Clearly, for 5 < p < 1 — 5 we have err„(r, Pp) < maXygY(err^(r, Pp)), and 
if err„(r, Pp) < e then err^(r, Pp) < e/5 ioi each ?/ G Y. 

Let m be such number that m — Xm = n. For any G Y°° such 
that \mpm[y'^) — mp\ < Xruf^ there exist such mapping vr : {1, . . . ,n} 
{1, . . . , m} that yli^-^ = y} for any i < n. Define random variables X[ . . . X'^ 
as follows: X'^^^^ := Xi for i < n, while the rest of X'^ are some random 
variables independent from Xi, . . . , X^ and from each other, and distributed 
according to Pp (a "ghost sample"). We have 



P;(err^(Xi,i/J,...,X„,yi)>e) 

= P;^ ( errliX,, yl . . . , X„, y'J - err^(X;, yj, ...,X'^, yl) 

+ eiil{X[y,,...,X'^,yl)>e) 

< P,'" (1 err^(X;, yl . . . , X^, yl) - err^(Xi, yl . . . , X„, yl) \ > 5/2) 

+p;(err:^(x;,y^,...,x;,y^)>^/2). 



Observe that y^ was chosen arbitrary (among sequences for which \mpm{y^) — 
mp\ < Km/2) and (Xi, y\, . . . , X^yl) can be obtained from {X[,yf, . . . , X'^y^J 
by removing at most Xm elements and applying some permutation. Thus the 
first term is bounded by 




y^) , first in terms of 




max 

j<>Cm\ 7r:{l,...,m}-+{l,...,m} 



err; 



^1, • • • , Zm) 



. . . , Z^(rn-j))\ > I \mp{m) - mp\ < Xm/2) 




< 



Pp{\mp{'m) — mp\ < x^) ~ 1 — \fm 




16 



and the second term is bounded by ^_y^ P^{eYTjn(X) > Se/2). Hence 

P^{eTil{X^,yl...,Xr,,y'J>e) 

< a„(A(Pp,m,fe/2) + P7(err„(r) > fc/2)). (15) 

Next we establish a similar bound in terms of A. For any G Y" such 
that |?T-Pn(y^) ~ np\ < x„/2 there exist such permutations tti,tt2 of the set 
{l,...,n} that ?/^^(.) = ?/^^(.) for any i < n - 5x„. Denote n - 6>in by n' 
and define random variables X[ . . . X'^ as follows: -^^2(0 "^tiC*) ^ — ^' 1 
while for < i < n X[ are some "ghost" random variables independent from 
Xi, . . . , X„ and from each other, and distributed according to Pp. We have 

P;{eiiy^{X,,yl...,Xn,yl)>e) 
< P;+- (I err:i(Xi, yl . . . , X^, yl) - errHX^yl . . . , X„, y^) | > e/2) 

+ P^[eTTliX[,yl...,Xiyl)>e/2), 

Again, as was chosen arbitrary (among sequences for which |np„(y^) — 
np\ < Xrif^) and {Xi,yl, . . . , X^yli) differs from (X[, yf,..., X'^y"^) in at most 
Xn elements, up to some permutation. Thus the first term is bounded by 

P;( sup \eiil{Z,,...,Zn) 

j<>c„;-K:{l,...,n}^{l,...,n};z'^_-,...,z'^ 

- err^(Ci, ...Xn)\>e/2\ \np{n) - np\ < x„/2) 

< a„A(Pp,n, fe/2), 

and the second term is bounded by Q;„Pj^(err„(r) > 5e/2). Hence 

P;(err^(Xi,2/!,...,X„,y^)>£) 

<«„(A(Pp,n,fe/2) + P;(err„(r) >fe/2)). (16) 

Finally, as y^ was chosen arbitrary among sequences y G such that 
n5 < Pn(y^) < ^(1 - 5) from (dg) and ^ we obtain © and ©• □ 

Appendix B: proofs for Section El 

The first part of the proof is common for theorems El and El Let us fix some 
distribution P satisfying conditions of the theorems. It is enough to show 
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that 

sup ^~(err„(r,Pp,Zi,...,ZO) ^0 

pe[<5,i-<5] 

and 

sup E°°(A(Pp,n,Zi,...,ZO) -^0 

pe[<5,i-<5] 

for nearest neighbour and partitioning predictor, and apply Corollary^ 

Observe that both predictors are symmetric, i.e. do not depend on the 
order of Zi, . . . , Z„. Thus, for any Zi, . . . , Zn 

A{Pp,n,zi, ...,Zn)= sup 

j<>c„; n:{l,...,n}^{l,...,n},z'^_.,...,z'„ 
I Grrn(r5 Ppi Zlj . . . J Zn) Grr„(r, Pp, 2:7^(1), . . . , ^7^(71-^)5 ^n-ji • • • 1 ^n)\i 

where the maximum is taken over all z'^ consistent with t], n — j < i < n. 
Define also the class-conditional versions of A: 

A^(Pp,n,2;i, ...,Zn):= sup 

j<>c„; n:{l,...,n}->{l,...,n},z'^_.,...,z!^ 
I ^^^ni^i Ppi ■^li ■ ■ ■ 1 ^n) Grr^(r, Pp, ^7^(1), . . . , Z^^^n-j)} ^n-ji • • • i ^n)\' 

Note that (omitting Zi, . . . ,Zn from the notation) err„(r, Pp) < err° (F, Pp) + 
err;'j(r, Pp) and A(Pp, n) < A°(Pp, n) + A^(Pp, n). Thus, it is enough to show 
that 

sup E°^(erri(r,Pp))^0 (17) 

p6[<5,l-'5] 

and 

sup E°°{A\Pp,n)) ^0. (18) 

pe[5,i-5] 

Observe that for each of the predictors in question the probability of error 
given that the true label is 1 will not decrease if an arbitrary (possibly large) 
portion of training examples labelled with ones is replaced with an arbitrary 
(but consistent with t]) portion of the same size of examples labelled with 
zeros. Thus, for any n and any p G [5, 1 — 5] we can decrease the number of 
ones in our sample (by replacing the corresponding examples with examples 
from the other class) down to (say) 6/2, not decreasing the probability of 
error on examples labelled with 1. So, 

E°°(erri(r, Pp)) < E^{eTTl^{T, Ps^Pn = 5/2)) + Pp(p„ < S/2), (19) 
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where as usual p„ := ^#{« < n : yi = 1} . Obviously, the last term (quickly) 
tends to zero. Moreover, it is easy to see that 

< E^{eTTl{r,Ps/2)\H6/2)-pr,\ < xj2)+E^{A\Ps/2,n)) 

< ^ _ l^^ E^{eTT\{r, Ps/2)) + E^{A\Ps/2, n)). (20) 

The first term tends to zero, as it is known from the results for i.i.d. processes; 
thus, to establish (|T7j) we have to show that 

E{A\Pp,n,Z,,...,Zr,))^0 (21) 

for any p e (0, 1). 

We will also show that (PT|) is sufficient to prove (fTSj) . Indeed, 

A^(Pp, n,zi,..., Zn) < err;^(r, Pp, Zi, . . . , z„) + 

sup errj^(r, J^, ^77(1), • • • , ^7r(?i— j); ^n—jy • • • ; ^n) 

j<>c„; ■K:{l,...,n}-*{l,...,n},z'^_.,...,z'„ 

Denote the last summand by D. Again, we observe that D will not decrease 
if an arbitrary (possibly large) portion of training examples labelled with 
ones is replaced with an arbitrary (but consistent with rj) portion of the 
same size of examples labelled with zeros. Introduce A^(Pp, n,Zi,..., Zn) as 
A^{Pp,n, zi, . . . , Zn) with x„ in the definition replaced by |x„. Using the 
same argument as in (jl9|) and (j2(jp we have 

P°°(D) < l-^{E^{A\Ps/2,n))+E^{eTTn{T,Ps/2))+Pp{Pn < S/2). 

1 — 1 / \/n 



Thus, (UHl) holds true if (jHl) and 

P°°(Ai(Pp,n,Zi,...,Z„)) ^0. (22) 

Finally, we will prove (PT|) : it will be seen that the proof of ()22|) is analo- 
gous (i.e. replacing x„ by |x„ does not affect the proof). Note that 



P°°(A(Pp,r2,Zi,...,Z„)) <Pp( sup 



j<x„; n:{l,...,n}-^{l,...,n},z'^_.,...,z'„ 

err„(r, Pp, Zi, . . . , Zn) 7^ err„(r, Pp, .^,7(1)) • • • ? ■^7r(n-j); ^n-j^ . . . , z 
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where the maximum is taken over all z[ consistent with rj, n — j < i < n. 
The last expression should be shown to tend to zero. This we will prove for 
each of the predictors separately. 

Nearest Neighbour predictor. Fix some distribution Pp, < p < 1 and 
some e > 0. Fix also some n eN and define (leaving xi, . . . ,Xn implicit) 

Bn{x) := Pp^^{t G X : t and x have the same nearest neighbour among xi, . . . , Xn} 

and Bn '■= E{Bn{X)) Note that E°°{Bn) = 1/n, where the expectation is 
taken over Xi, . . . ,X„. Define B := {(xi, . . . G : P„ < 1/ne} and 
A{xi, . . . ,Xn) '■= {x : Bn{x) < 1 / uE^} . Applying Markov's inequality twice, 
we obtain 

£;°°(A(Pp, n)) < ^°°(A(Pp, n)\{X,, . . . , X„) G + e 
< [ sup 

Pp^x . err„(r, Pp, Zi, . . . , -^n) 7^ 6rr„(r, Pp, Z^(^\^^ • • • ? -2^7r(n— j); . . . , 2„) 
|xG^(Xi,...,X„)}|(Xi,...,X„) Gi3) +25. 

(23) 

Removing one point Xi from a sample Xi, . . . , x„ we can only change the 
value of r in the area 

{x G X : Xj is the nearest neighbour of x} = P„(xj), 

while adding one point xq to the sample we can change the value of F in the 
area 

Dn{xo) := {x G X : xo is the nearest neighbour of x}. 

It can be shown that the number of examples (among Xi, . . . , x„) for which 
a point Xq is the nearest neighbour is not greater than a constant 7 which 
depends only the space X (see |10^, Corollary 11.1). Thus, 

-Dn(Xo) C Uj=j^^...j^P„(Xj) 

for some ji, . . . and so 

E°°(A(Pp,r2)) < 26 + 2(7 + 1) x„E°°( max P„(x)|(Xi, . . . , X„) G S) 

x£AiXi,...,X„) 

7 + 1 
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which, increasing n, can be made less than 3e. □ 
Partitioning predictor. For any measurable sets B C X" and ^ C X 
define 

D{B,A) := E°°f sup 

j<^; Tr:{l,...,n}^{l,...,n},z'^_.,...,z'„ 

Pp|x : err„(r, Pp, Zi, . . . , Zn) 7^ err„(r, Pp, ^,r(i); • • • ? ■^7r(n-j)5 • • • ? 4) 
|x e ^}|(Xi,...,X„) G B^ + 2e. 

and P> := P)(X",X). 

Fix some distribution Pp, < p < 1 and some £ > 0. Introduce 

1 " 

Xi, . . . , X„) := ^ /y.=i/x.GA(x) 



1=1 



(Xi, . . .Xn will usually be omitted). From the consistency results for i.i.d. 
model (see, e.g. [TU], Theorem 6.1) we know that E'^^^\f}n{X) — r]{X)\ — > 
(the upper index in E"-^^ indicating the number of examples it is taken over). 

Thus, E\fin{X) — ri{X)\ < from some n on. Fix any such n and let 
B := {(xi, . . . ,Xn) '■ E\fin{X) —ri{X)\ < e^}. By Markov inequality we obtain 
Pp{B) > l—e^. For any (xi, . . . , x„) G B let A[xi, . . . , Xn) be the union of all 
cells for which ^(|r)„(X) - ri{X)\\X G A^*) < e. Clearly, with xi, . . . 
fixed, Pp(X G ^(xi, . . . , x„)) > 1 - e. Moreover, P> < P'(i3, A) + e + . 

Fix ^ := (xi,...,x„) for some (xi,...,x„) G B. Since r7(x) is always 
either or 1, to change a decision in any cell A C ^ we need to add or 
remove at least (1 — e)X{^A) examples, where Xi^A) := A^(x) for any x G A. 
Let X{n) := E{X{X)) and A{n) := E{Pp{A{X)). Clearly, = 1 for any 

n, as E^ = A{n). 

As before, using Markov inequality and shrinking A if necessary we can 
have Pp(%^ < e|X G ^) = 1, Pp(^^ < e|X G ^) = 1, and D < 
D{B,A) + 3e + e^. Thus, for all cells A C A we have X(v4) > enA{n), so 
that the probability of error can be changed in at most 2^^^^^^^ cells; but 

the probability of each cell is not greater than Hence E°°(A(Pp,n)) < 

2 , +3e + e^. □ 

n{l—£)£'' 
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Appendix C: proofs for Section 31 

Proof of Theorem ^ Fix some probability distribution Pp and some n G N. 
Let (f^ he any decision rule ip E C picked by r„_^„ on which (along with the 
corresponding permutation) the maximum 

max I err„(r, Zi, . . . , z„) - err„_j(r, . . . , z^(n-j))\ 

j<>c„; 7r:{l,...,n}-^{l,...,n} 

is reached. We need to estimate P"(| eii{ip*) — err((/3^)| > e). 

Clearly, |efr„(y9^)— errri(v9*)| < x„, as Xn is the maximal number of errors 
which can be made on the difference of the two samples. 

Moreover, 

P"(|err(¥,:)-err(y,x)|>5) 

<P"(|err(y,:)-ieff„(y,*)|>£/2) 

+P"(|-efr„(¥P^) - err(y.x)| > e/2 - x„/n) 

Observe that 

P'^(sup I -eff„((^) - err((^) \> e) < 85(C, n)e-""'/^^ (24) 

see [ini, Theorem 12.6. Thus, 

A(Pp,n,£) < 165(C,n)e-"(^/'-""/")'/=^2 < 165(C, n)e-"^'/^^2 
for n > So, 

P(err„(r,P) > e) < /supj,g[,,i_,] err(</Ppp,Pj,)>£/2 

+16«C-^5(C, n)e-"^'^'/'°^^ + (1 - C,). 

It remains to notice that 

err(v9p Pp) = inf (perr^(v9, Pp) + (1 - p) err°(v?, Pp)) 

< inf(err^(^,Pi/2) +errO((/^,Pi/2)) = 2err((/^p Pi/a) 

for any p G [0, 1]. 

So far we have proven (jlip and (jl2j) : (jl3|) and (jl4j) can be proven analo- 
gously, only for the case G C we have 

P"(sup |-eff„((^) - err(y?)| > e) < 5(C,n)e-"" 
instead of (j211), and err(y9pp,Pp) = 0. □ 



22 



References 



[1] D. Aldous and U. Vazirani A Markovian extension of Valiant's learn- 
ing model. In Proceedings of the 31st Symposium on Foundations of 
Computer Science, pp. 392-396, 1990. 

[2] P. Algoet, Universal schemes for learning the best nonlinear predictor 
given the infinite past and side information IEEE Transactions on In- 
formation Theory, Vol. 45, No. 4, 1999. 

[3] P. Bartlett, S. Ben-David, S. Kulkarni, Learning changing concepts by 
exploiting the structure of change. In Proceedings of the Workshop on 
Computational Learning Theory, pp. 131-139, Morgan Kaufmann Pub- 
hshers, 1996 

[4] E. Baum and D. Haussler, What size net gives valid generalisation? 
Neural Computation, 1:151-160, 1989. 

[5] A. Blumer, A. Ehrenfeucht, D. Haussler M and Warmuth Learnability 
and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36, pp. 
929-965, 1989. 

[6] O. Bousquet, A. Elisseeff. Stability and Generalization. Journal of Ma- 
chine Learning Research, 2: 499-526, 2002. 

[7] A. P. Dawid Conditional Independence in Statistical Theory. Journal of 
the Royal Statistical Society, Series B (Methodological), Vol. 41 No 1, 
pp. 1-31, 1979 

[8] L. Devroye, On asymptotic probability of error in nonparametric dis- 
crimination. Annals of Statistics, 9:1320-1327. 

[9] L. Devroye, L. Gyorfi, A. Krzyzak, G. Lugosi, On the strong universal 
consistency of nearest neighbor regression function estimates. Annals of 
Statistics, Vol. 22, pp. 1371-1385, 1994. 

[10] L. Devroye, L. Gyorfi, G. Lugosi, A probabilistic theory of pattern recog- 
nition. New York: Springer, 1996. 

[11] L. Gyorfi, G. Lugosi, G. Morvai. A simple randomized algorithm for 
sequential prediction of ergodic time series. IEEE Transactions on Infor- 
mation Theory, vol.45, pp. 2642-2650, 1999. 



23 



[12] D. Helmbold and P. Long, Tracking drifting concepts by minimizing dis- 
agreements. Proceedings of the fourth annual workshop on Computa- 
tional learning theory, Santa Cruz, USA, pp. 13-23, 1991. 

[13] D. Gamarnik, Extension of the PA C framework to finite and countable 
Markov chains IEEE Transactions on Information Theory, 49(1) :338- 
345, 2003. 

[14] M. Kearns and D. Ron, Algorithmic stability and sanity-check bounds on 
leave-one-out cross-validation. Neural Computation, 11(6): 1427-1453, 
1999. 

[15] M. Kearns M. and U. Vazirani An Introduction to Computational Learn- 
ing Theory. The MIT Press, Cambridge, Massachusetts, 1994. 

[16] S. Kulkarni, S. Posner. Rates of Convergence of Nearest Neighbour Es- 
timation Under Arbitrary Sampling. IEEE Transactions on Information 
Theory, Vol. 41, No. 10, pp. 1028-1039, 1995. 

[17] S. Kulkarni, S. Posner, S. Sandilya. Data-Dependent kn-NN and Kernel 

Estimators Consistent for Arbitrary Processess. IEEE Transactions on 
Information Theory, Vol. 48, No. 10, pp. 2785-2788, 2002. 

[18] G. Lugosi, A. Nobel, Consistency of data-driven histogram methods for 
density estimation and classification. Annals of Statistics vol. 24, No. 2, 
pp.687-706, 1996. 

[19] G. Lugosi and K. Zeger Nonparametric Estimation via empirical risk 
minimization IEEE Transactions on Information Theory, Vol 41 No 3 
pp. 677-687, 1995. 

[20] G. Morvai, S. Kulkarni, and A.B. Nobel, Regression estimation from an 
individual stable sequence, Statistics, vol. 33, pp. 99-118, 1999. 

[21] G. Morvai, S. Yakowitz, P. Algoet, Weakly Convergent Nonparametric 
Forecasting of Stationary Time Series IEEE Transactions on Informa- 
tion Theory, Vol. 43, No. 2, 1997 

[22] A.B. Nobel, Limits to classification and regression estimation from er- 
godic process, Annals of Statistics, vol. 27, pp. 262-273, 1999. 



24 



[23] W. Rogers and T. Wagner. A finite sample distribution-free performance 
bound for local discrimination rules. Annals of Statistics, Vol 6 No 3 pp. 
506-514, 1978. 

[24] B. Ryabko, Prediction of random sequences and universal coding. Prob- 
lems of Information Transmission, Vol. 24, pp. 87-96, 1988. 

[25] D. Ryabko, Online Learning of Conditionally I.I.D. Data Proceedings of 
the 21 ^* International Conference on Machine Learning, Banff, Canada, 
pp. 727-734, 2004. 

[26] D. Ryabko, Application of Classical Nonparametric Predictors to Learn- 
ing Conditionally LLD. Data Proceedings of 15th International Con- 
ference on Algorithmic Learning Theory, Padova, Italy, pp. 171-180, 
2004. 

[27] L. Valiant, A theory of the learnahle. Communications of the ACM, 27, 
pp. 1134-1142. 1984 

[28] V. Vapnik, Statistical Learning Theory, New York etc.: John Wiley & 
Sons, Inc. 1998 

[29] V. Vapnik, and A. Chervonenkis. Theory of Pattern Recognition. Nauka, 
Moscow, 1974 (in Russian). 



25 



